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Abstract 

In this paper we discuss the methodological issues of using a class 
of neural networks called Mixture Density Networks (MDN) for discrim- 
inant analysis. MDN models have the advantage of having a rigorous 
probabilistic interpretation, and they have proven to be a viable altern- 
ative as a classification procedure in discrete domains. We will address 
both the classification and interpretive aspects of discriminant analysis, 
and compare the approach to the traditional method of linear discrimin- 
ants as implemented in standard statistical packages. We show that the 
MDN approach adopted performs well in both aspects. Many of the ob- 
servations made are not restricted to the particular case at hand, and are 
applicable to most applications of discriminant analysis in educational 
research. 
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1 Introduction 



Artificial neural networks (Haykin, 1994) can be viewed as a family of nonlin- 
ear models used for empirical regression and classification modeling. Such net- 
works have been successfully used in various fields for nonlinear modeling and 
approximation, for example in speech recognition (Kohonen, 1995), expert sys- 
tems (Gallant, 1993), and machine vision(Hinton and Sejnowski, 1983). More 
recently they have also been applied for data analysis in various financial do- 
mains (Baestaens et al., 1994). The increasing importance of neural networks 
as nonlinear models is witnessed by the fact that currently many of the stand- 
ard statistical software packages include feed-forward neural network modeling 
in their tool box. Similarly the recent multidisciplinary research efforts in the 
field of “Knowledge Discovery in Databases” (Fayyad et ah, 1996) use quite 
frequently neural network techniques. 

Neural network models are composed of a large number of individual com- 
putational elements called nodes, which are linked together to form a structure 
(called the architecture). This structure typically classifies the neural network 
type: feed-forward neural networks are layered structures, whereas recurrent 
neural networks introduce feedback links down the network. The nodes are 
associated with a nonlinear function y = /(a;), and the links have associated 
weights w. The computation is organized by combining the weights with in- 
puts, i.e., multiplying the input value x,- by the corresponding value re,-, which 
is then given as argument for /. Thus the computation for a single node is 
given by 

y = f(%2wiXi). 

x t 

Intuitively, the weights are the parameters of the model, and the learning of a 
neural network from a data sample means parameter estimation. The descrip- 
tion above is a gross simplification of this very rich set of models, but captures 
the essential idea. 

Giving an introduction to the various different types of neural networks and 
their related learning algorithms (parameter estimation methods) is outside the 
scope of this paper, and an interested reader should consult one of the excellent 
text books available (Bishop, 1995; Haykin, 1994; Ripley, 1996), or many of 
the introductions to neural networks from a statistical perspective (see e.g., 
(Cheng and Titterington, 1994; Ripley, 1993)). Two reviews of Hinton (Hinton, 
1989; Hinton, 1992) are also valuable. We would like to point out that neural 
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network model analysis can be based on various different viewpoints from 
particle physics and statistical modeling to biological simulation, or automata 
theory. Our approach to neural networks is based on seeing them as probabilistic 
models , which, as opposed to some other views, gives us a rigorous underlying 
theoretical framework for their analysis and use. In this sense we conform to 
the work presented in (Bishop, 1995; Mackay, 1992; MacKay, 1992a; MacKay, 
1992b; Jordan and Jacobs, 1994). 

In spite of their widespread use for data modeling in economics, physics, 
computer science and pattern recognition, neural networks are almost unknown 
in the educational sciences community. This is perhaps partly caused by the 
unfamiliar terminology associated with neural networks due to their origin 
from cognitive and neurosciences, partly by the lack of demonstrations of the 
applicability of the methods for educational data. Since many neural network 
models assume continuous input variables (outputs), educational data sets, such 
as questionnaire data, have not been modeled due to their discrete nature. In 
this paper we focus on the use of a particular class of neural networks called the 
Mixture Density Networks (Bishop, 1994) in the analysis of educational data. 
This intuitively very appealing neural network model family is introduced in 
Section 2, and can be understood as an implementation of a particular subclass 
of finite mixture models (Titterington et ah, 1985). 

Klecka (Klecka, 1981) defines discriminant analysis to be a set of statistical 
techniques to study the differences between two or more groups of objects with 
respect to several variables simultaneously. In educational research discrimin- 
ant analysis is used for two different purposes: 

• Interpretation of group differences — i.e., to find out if one is able to 
discriminate between the groups on the basis of some set of characterist- 
ics. In addition one might be interested in finding which characteristics 
are the most powerful discriminators. 

• Classification — i.e., predicting the group membership of new data for 
which the group information is not known. 

In Section 5 we will discuss the use of Mixture Density Networks for the 
classification problem formulated in Section 3. Instead of just presenting clas- 
sification accuracy information, we want to put the results in perspective, and 
compare them to the ones achieved by the traditional linear discriminant ana- 
lysis (McLachlan, 1992). We would like to point out that the purpose here is 
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not to demonstrate the superiority of the Mixture Density Network approach 
in classification accuracy (although, due the power of the underlying mixture 
model language, this in many cases is the case). Rather we would like to 
discuss methodological issues for both constructing the classifiers and for eval- 
uating their quality, if one moves from the linear discriminant framework to 
neural network approaches. Many of the concerns raised are well-known in 
the computational intelligence community (Bezdek, 1994), but seem to be very 
seldom discussed in the educational quantitative methodology literature. 

Finally we will address the interpretive side of the discriminant analysis. 
Since any model that predicts well has captured an underlying regularity in the 
data, an interesting question is whether that information can be extracted from 
the model representation. Traditional neural networks suffer from the fact that 
the language of weight matrix and node functions is not easily interpretable, 
and results in a “black box” approach, which clearly is not useful in most 
cases for educational data analysis. In Section 6 we will briefly illustrate that 
this is not the case for Mixture Density Networks (due to their probabilistic 
semantics), and discuss the interesting explorative possibilities offered by the 
MDN models. 

We aim at keeping the technical level of our discussion at as moderate level 
as possible, and focus on discussing the methodological issues using a typical 
example data sets, one of them being from a recent educational study. Readers 
not interested in the technical details of the MDN network models can browse 
Sections 2 and 3, and go directly to the description of the problem domains 
(Section 4), from which the data samples for the experiments were taken. 



2 Mixture Density Networks 

Mixture Density Networks (Bishop, 1994) is a neural network class which can 
be used to represent general conditional probability densities p(i\d) by consid- 
ering a (semi)parametric model for the distribution of whose parameters are 
determined by the outputs of a feed-forward neural network, which takes d as 
its input. Thus the MDN models are actually a combined neural network struc- 
ture and a density model (for more details see the discussion in (Bishop, 1995)). 
Provided we consider a sufficiently flexible network, and a sufficiently general 
density model, we have a framework for approximating arbitrary conditional 
distributions. 
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Typical choices for a parametric model are a single Gaussian or a linear 
combination of fixed set of kernel functions. A very general framework for 
modeling unconditional distributions can be based on the set of discrete finite 
mixtures ((Everitt and Hand, 1981), (Titterington et ah, 1985)), where the joint 
domain probability distribution is approximated as a weighted sum of mixture 
distributions. 

Let Xi , . . . , X m be a set of m (m > 1) discrete (random) variables, and 
d £ D is a sample from the joint distribution of the variables Xi, . . . , X m . 
Then the finite mixture distribution for d can be written as ( I\ > 1) 

p(d) — p(Aj — X\y . . * ) X m ) 

K 

= Yj(p(y = yi')p( x i =x 1 ,...,X m = x m \Y = y k )), (1) 

k = 1 

where Y denotes a latent clustering random variable , the values of which are 
not given in the data Z), and K is the number of possible values of Y . 

Thus in finite mixture models the problem domain probability distribution 
is approximated by a weighted sum of mixture distributions, where each mix- 
ture component p(X i = x l9 . . . , X m = x m \Y = yu) models one data producing 
mechanism. If the variables X \ 9 . . . , X m are independent, given the value of the 
clustering variable V", equation (1) becomes 

K / m \ 

p(d j = £ Uv = Vk) n P(X, = x,\Y = y t ) . (2) 

k= 1 \ i=l / 

For the Mixture Density Networks considered here this independence assump- 
tion holds and consequently computation uses equation (2). 

A finite mixture model partitions the data to K clusters. This partitioning 
can be modeled by introducing for each data vector dj an unobserved latent 
variable Zj, the value of which gives the the cluster index for the cluster vector dj 
belongs to. We can now think a vector Z = ( 21 , ... , 2yv), consisting of the values 
of the latent variables Zi, . . . , Z/v, as a random sample from the distribution 
of Y like D is a random sample from the joint distribution of Ah , . . . , X m . 
However, for technical reasons it is more convenient to consider each value Zj 
as a vector of cluster indicator variable values, Zj = ( 2 ^ 1 , . . . , zjk), where 

if dj is sampled from ^(*1^ = y^), 
otherwise. 
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Finite mixtures as defined in equation (2) is a generic model family, as we 
still have to fix the cluster distribution p(Y) and the intra-class conditional 
distributions p(X{\Y = Pk ) 1 ■ Most commonly used component functions in the 
literature are the univariate normal distributions (see e.g., (Titterington et ah, 
1985)). In educational domains the variables are usually discrete, thus we can 
drop the assumption of the form of the distribution. For the univariate case a 
binomial model could be used, but for the general case with m > 1 a natural 
choice is the multivariate generalization of the binomial distribution called the 
multinomial distribution 

where c = (c\,. . . ,c ni ) is the vector of counts of the number of observations 
of each value of Xi. In addition the sum of probabilities @j = 1 and 

cj = N' ( N ' is the total number of observations). Since we are interested 
in the data distribution, i.e., p(Xi\Y = p k ) the multinomial distribution form 
simply reduces to a product of probabilities 9j. Analogously we assume that 
the cluster distribution p(Y) is multinomial. Thus in order to get a model, 
we need to fix the number of the mixing distributions ( I \ ), and determine the 
values of the model parameters. For technical reasons it will be convenient to 
make a notational distinction between the mixture weight parameters and the 
parameters of the intra-class conditional distributions, i.e., 0 = (a,$),0 € fi, 
where a = (c*i, . . . , oik) and $ = (<J> U , . . . , $i m , . . . , $ai, . • ■ , $A'm), with the 
denotations a k = P(Y = p k ), $ki = (0Wi> • • • where 4>ka = P(X{ = 

xu\ Y = p k ). 

Since our estimation of the network parameters will be Bayesian (Bernardo 
and Smith, 1994) we need to fix the prior distributions for the parameters. The 
family of Dirichlet (multivariate Beta) densities is conjugate to the family of 
multinomials, therefore we assume that prior distributions of the parameters 
are (c*i, . . . , a K ) ~ Di (^i, . . . , px) and • • ■ , <j>kim) ~ Di (<r*ii, . . . , <Tjw„,.), 

(1 < k < K, 1 < i < m), where 

{phi^kii | 1 ^ k < Kj 1 < i < m; 1 < / < n,} 

are called the hpper parameters of the corresponding distributions. Assuming 

'Here we consider only mixtures in which all the component distributions come from the 
same parametric class. 
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that the parameter vectors a and $ki are independent, the joint prior distribu- 
tion of all the parameters can be expressed as 

K m 

Di(/ui,...,|«A-) nn Di {cTkil 5 • • • 5 &kini ) • 

Ar=l i=l 

The finite mixture model family is universal in the sense that it can ap- 
proximate any distribution arbitrarily close as long as a sufficient number of 
components is used. Unfortunately such generality typically implies also that 
parameter estimation can become computationally inefficient. Therefore the 
networks used in our experiments will be a special subclass of the general Mix- 
ture Density Networks. This class follows from equation (2) when we remove 
the latency of Y and assume that one of the variables gives us 

the partitioning of the data (for notational simplicity we will assume that it is 
always X m ). These new models correspond to a specific subclass of the more 
general case, thus the joint probability distribution for a data vector d can be 
written as 

p(d) — p[\\ = X \ , . . . , Xjji — x m , Xjji — 

rim / m — 1 \ 

= £ P(X rn = j) n K*i = *1*- = k) . (3) 

j=l \ 1 / 

3 Classification problem 

Let us now return to the classification problem. The purpose of a classification 
procedure is to predict the value of a single class variable of a new partially 
observed data vector, based on the model (e.g., discriminant functions) con- 
structed from the sample. 

Given the data sample D, MDN predictions are based on the conditional 
distribution p(d\D) of a new test vector d , where 

p(lD) 
p(D) ■ 

The classification problem can now be restated: Given the values of the vari- 
ables Xi , . . . , X m -i, and a data sample D, predict the value of variable X m . 
For notational simplicity, in the sequel we drop the variable names, and de- 
note a value assignment (Vi = X\,X 2 = x^, . . . ,X m -i = x m _x) by writing 
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p(d\D) 
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(a?!, X 2 , . . . , x m _i). Now for each possible value x m ,-, x m ,- E {x ml , . . . , x mnm } we 
wish to compute the probabilities 

P(X m = ■ ■ ■ ) 3-m — i ), Z)). 

From the Bayes’ theorem (Bernardo and Smith, 1994) we know that 

P(-^m — | (sj, . . . , £ m _i), D) 

_ p(d[x mt ]|£)) . 

E£,p<<<M|£>)’ 

where denotes the vector (Ah = xi, . . . , A r m _ 1 = x m _i , X m = x m i). 

Consequently, the conditional distribution for variable X m can be computed 
by using the complete data vector conditional distributions (4) for each of 
the possible complete vectors d[x m i]. The resulting distribution is called the 
predictive distribution of X m . 

The derivation of different possibilities as the predictive distribution p(-) in 
the case of MDN is somewhat involved and omitted here. The derivations can 
be found e.g., in (Heckerman et ah, 1995; Kontkanen et ah, 1997; Tirri et ah, 
1996). For the present purposes it is enough to state that for the restricted case 
of finite mixtures discussed in the previous section, the calculation of the pre- 
dictive distribution can be performed efficiently without any approximations. 



4 Data description 

For our experiments we used three data sets, one from medical domain (Primary 
Tumor), one from chemical analysis (Glass) and an educational data set from 
a recent study (Effectiveness). The Primary Tumor data sets concerns pre- 
dicting the location of primary tumor, where the location of the cancer is the 
group variable. Glass Identification database (USA Forensic Science Service) 
is concerned of grouping glass defined by their oxide content (i.e. Na, Fe, K, 
etc). Both of these data sets are standard benchmarks for comparing different 
classification procedures. Since the educational data used is particular to the 
study at hand, we will give a more detailed description of it. 

The educational data used in this study was gathered for the research pro- 
ject “Effectiveness of Teacher Education in Finland” in the spring 1996. The 
objective of the project was to evaluate the effectiveness of Finnish teacher 
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Data set 


Size 


^Variables 


^Classes (Groups) 


Glass 


214 


10 


6 


Primary tumor 


339 


18 


21 


Effectiveness 


204 


42 


4 



Table 1: The description of the data sets used in our experiments. 



education at various levels from individual to international teacher education 
policy. A more detailed description of the framework and research conducted 
in the project is discussed in (Niemi and Tirri, 1996). The data adopted to this 
study was gathered to investigate how well teacher education had been able to 
achieve the goals set to it. These goals were selected from school-law, pro- 
grams of teacher education and other documents describing teachers’ work at 
school. The teachers and their educators from four different teacher education 
departments in Finland were asked to perform self-evaluation on the success 
of teacher education for helping teachers to achieve these goals. The evalu- 
ation instrument consisted of 41 behavior statements (and information about 
the teacher education department), and used a Likert scale from 1 to 5 for the 
assertions. The results of this evaluation study are reported in the forthcoming 
study (Niemi and Tirri, 1997). 

The data sample used for our comparison is derived from the teachers’ 
data in the study described above. This data consists of ratings of 204 Finnish 
teachers. The subjects were teaching at two levels, one half being element- 
ary school class teachers (N=110) and the other half secondary school subject 
teachers (N=94). These teachers came from four different teacher education 
departments in three different counties of Finland. The gender distribution was 
representative to that of Finnish teacher population — 25% were males. 

A short description of the data sets used can be found in Table l. 2 

2 The Primary Tumor and Glass data sets can be obtained from the UCI data repository 
at URL address “http://www.ics.uci.edu/~mlearn/” . 



9 



O 

ERIC 



5 MDN in classification 



Let us now first study the problem of developing a classification procedure, 
which would allow us predict the group to which a given data vector most 
likely belongs. For example for the Effectiveness data set this means develop- 
ing a model, which would allow us to predict from which of the four different 
teacher education departments a teacher comes from based on his/her answers 
to the questions. In the application domain this information is interesting for 
finding the topics that could be improved in each of the teacher education de- 
partments. Here we will allow the classification procedures to use all the 41 
predictor variables in constructing the predictive model, which is atypical to 
a questionnaire data analysis. In practice for this type of problems discrim- 
inant analysis is preceded by dimensionality reduction procedures, e.g., factor 
analysis, and one would use summarized information such as the factor scores 
instead of the primary variables. Knowing the difficult issues related to select- 
ing a proper factor structure, this would, however, introduce another parameter 
to our study, i.e., the discriminative quality of the factor variables constructed. 
Although the analysis is performed at the primary variable level, all discussion 
is naturally valid for discriminant analysis with factor scores also. 

5.1 Testing with sample vs. cross validation 

The traditional classification procedures in linear discriminant analysis typic- 
ally use either the discriminating variables or the canonical discriminant func- 
tions constructed from the data (Klecka, 1981). We assume that the reader 
is familiar with the standard approach as implemented in the SPSS statistical 
software package (Norusis, 1990), and do not repeat the principles here. 

What we are more interested in is the validation of the classification pro- 
cedure constructed, either by the Linear Discriminant (LD) or by the Mixture 
Density network approach. For MDN we have described the classification pro- 
cedure as the calculation of the predictive distribution p(d[x m i]\D)y which in 
our case depends on parameters 0. The corresponding notion to our model 
0 in the linear discriminant analysis are the canonical discriminant functions 
(let us denote them by /id). A typical procedure to test the accuracy of f \ d 
is to classify the cases in the sample from which the model was constructed. 
We will call this approach training sample validation. The resulting percentage 
of correct predictions together with analysis of the difference to the expected 
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number of correct predictions is then, used to quantify the quality of the model. 

Although many textbooks include a warning about the fact that testing a 
model with the same data sample from which it is constructed (see e.g., (Klecka, 
1981), pp. 51-52) gives inflated estimates of the classification performance, this 
seems to be the standard practice, unless the size of the data sample is large 
and sometimes independent samples are used. In particular, use of k-fold cross 
validation (Stone, 1974), sometimes known as the “jackknife”, tends to be very 
rare. This is quite concerning, as it is well known that most parameter learn- 
ing procedures have a tendency to overfit , i.e., form classification functions that 
are more accurate for the sample than they would be for the full population. 
In particular we will demonstrate that the classification accuracy of both LD 
and MDN is substantially different for the training sample, than if measured 
by cross validation. The more parameters the model used in the discrimin- 
ant analysis has, the more severe this overfitting phenomenon is. With the 
exception of the simple model class of perceptrons (Haykin, 1994), all neural 
network model families are highly parameterized nonlinear function estimat- 
ors, and would perform extremely poorly, if the models were selected based on 
their training sample performance. Therefore in the computational intelligence 
community the training sample based validation has been totally replaced by 
other methods such as cross validation based estimation. 

An interesting question is, why has this not happened in the educational 
research community? The answer is intuitively simple, but has important con- 
sequences for the common practices, if neural networks models (or actually 
any highly parameterized or nonparametric model class) are to be used. The 
number of parameters for the hyperplanes used in /id for low-dimensional data 
spaces is so low that the model is not able to overfit much, and thus automatic- 
ally shows some generalization to the full population. This can be clearly seen 
from Table 5.1, where the discriminant function model /id with the variable 
selection (5 variables) is only able to fit the model to the training sample to 
reach 51% accuracy with 45.5% performance in leave-one-out cross validation. 
Notice that in the 41 variable case we see the difference of 22% for LD between 
the classification in the training sample and cross validation. Naturally MDN 
shows the same behavior, although for the more general (semiparametric) MDN 
the results would be even more illustrative; for this particular data set we could 
reach over 90% accuracy with the training sample with very poor generalizab- 
ility when tested with out of sample data. 

In Table 5.1 we report the classification accuracy of both the LD and MDN 
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DATA SET 


METHOD 


LD (SPSS) 


MDN 


Effectiveness 


nvs and mdo 








training sample 


72.0 


74.5 




5-fold crossvalidation 


46.0 


43.0 




leave-one-out crossvalidation 


50.0 


38.5 




vs and mdo 








training sample 


49.5 


58.0 




5-fold crossvalidation 


45.0 


44.5 




leave-one-out crossvalidation 


44.0 


45.0 




nvs and mdi 








training sample 


69.0 


75.0 




5-fold crossvalidation 


44.5 


39.5 




leave-one-out crossvalidation 


48.5 


39.5 




vs and mdi 








training sample 


51.0 


57.0 




5-fold crossvalidation 


44.0 


45.5 




leave-one-out crossvalidation 


45.5 


45.0 


Primary Tumor 


nvs and mdi 








training sample 


48.0 


56.9 




leave-one-out crossvalidation 


36.0 


49.0 


Glass 


nvs and mdi 








training sample 


64.5 


79.0 




leave-one-out crossvalidation 


60.3 


70.1 



Table 2: The comparison of the classification performance of the linear dis- 
criminant functions and Mixture Density Networks. The option “nvs” and 
“vs” denote that no variable selection/variable selection was used, i.e., 41 pre- 
dictor variables/5 predictor variables were used. Options “mdo” and “mdi” 
correspond to omitting data with missing values and including missing value 
as a value in the analysis, respectively. The numbers represent the percentage 
of correctly classified cases. 
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methods in the case of the data sets described in Section 4. For comparative 
purposes, in the Effectiveness case, we have also included the results for a re- 
duced variable set, the selection of the variables was performed by the standard 
stepwise selection procedure. In addition it is interesting to observe that as 
opposed to LD, MDN performance improves when the number of variables is 
decreased. This is due to the fact that instead of pure discrimination, MDN in 
fact tries to model the full joint distribution of the variables and thus has to 
balance the predictions for all the variables, not just the group variable. 

From the above discussion we would like to stress that reporting the classi- 
fication performance in the training sample is in most cases quite misleading, 
and definitely not to be used with more complex model families such as neural 
networks. 

5.2 Classification performance vs. training sample size 

In the previous Section we saw that for the Effectiveness data set LD out- 
performs the MDN in the cross validated error rate when all the 41 predictor 
variables are used, and for that for the 5 predictor variable case both methods 
showed equal performance. On the other hand for the Primary Tumor and 
Glass data sets MDN clearly outperforms the standard LD. Let us now study 
what happens to the performance of these two methods as a function of training 
sample size. 

In this type of experiments one randomly partitions sample in a training 
sample reservoir D r containing 70% of the sample, and a test set D q containing 
the remaining 30 %. One data d\ is then randomly taken out of the training 
reservoir and used as a training sample D\ = {di}. This initial training sample 
D\ is used to construct the model which is used to classify all dj € D q , and the 
predictions thus obtained for each d are then compared to the actual outcomes 
k. 

Next the training set D\ is extended by another data instantiation d 2 , un- 
equal to the element already in D\ , but otherwise randomly picked from the 
training reservoir D r . This new training set is denoted by D 2 . After building 
the new model, all dj € D q are classified again, and the results are compared 
tot he actual outcomes. This procedure of adding one training element to Di 
to form Di + i, determining the model using D,+i and predicting the value of the 
group variable for each entry in the test set is then repeated until Di + i = D r , 
i.e., contain the full reservoir. This whole procedure is then repeated 10 times 
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Figure 1: The classification performance of MDN and LD as a function of the 
sample size for the Effectiveness data set. 



with another split for the training sample. Figure 1 gives the average (over 10 
repetitions) performance of Mixture Density Networks and the linear discrim- 
inant methods as a function of the training sample size. Here we can see that 
both of the methods show quite similar small sample performance, the shapes 
of the curves being almost identical. However, this type of the analysis tells 
us about the complexity of modeling the data set with the model classes given. 
We can see that both methods are asymptotically approaching success rate of 
40-50%, which is not particularly high. 

6 MDN in exploratory analysis 

In standard discriminant analysis, once the canonical discriminant functions 
have been derived, one can try to interpret their meaning. This is typically done 
by examining the relative positions of the data cases and group centroids, and 
by studying the relationships between the individual variables and the functions. 
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Figure 2: A snapshot of the interface of the NOME software tool. 

One of the methods is to study the structure coefficients of the variables to see 
how much a variable X{ has in common with a discriminant function /id . In 
our MDN approach the corresponding notion would be the Kullback-Leibler 
distance of the unconditional and conditional marginal likelihood of X t , i.e., 

V KL (p(Xi\X m = k,0),p(Xi\<d)), where 

where £>kl (p, <?) is the relative entropy between p and q (Cover and Thomas, 
1991). Similarly the corresponding notion to Wilk’s lambda is the relative 
entropy between the unconditional and conditional joint distributions, i.e., 

V K h (p{X\X m = k,Q),p{X\Q)). 

MDN networks model the joint probability distribution of the variables 
Xi, . . . i X m . Once we have built our model O, we can in fact explore the 
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predictive (marginal) distribution of any variable X x given the values of other 
variables, not just the group variable X m . Modeling the full joint distribution 
gives us an extremely powerful exploratory tool — here we only want to briefly 
address some of the questions that can be answered by such a tool: 

• Variable predictive distributions for a given group. In the extreme 
case we can fix in the data vector only the value of the group variable 
X m , after which the MDN can calculate all the marginal predictive distri- 
butions. This means that one can study the distribution of any variable 
conditioned by the fact that the data vector d belongs to the group. For 
example in our Effectiveness data set we can fix one teacher education de- 
partment value, and then explore what is the predicted attitude towards 
readiness for multimedia teaching for teachers that graduated from that 
particular department. 

• Variable predictive distribution of the group variable given some 
combination of other variable values. We can reverse the situation 
in the previous item, and explore the effect of some value combination of 
variables X{,Xj ,. . . for predicting the group. Again, to give an example, 
we could explore which of the teacher education departments seems to 
have given the least readiness to teachers for using computers and multi- 
media in their teaching. 

• Variable predictive distribution of a non-group variable given 
some combination of other variable values. Similarly, based on the 
implicit clustering induced by the group variable, one could also explore 
the predictive distribution of any non-group variable X{ given the values 
of some other non-group variables Xj,Xk, etc., without fixing the group 
value. 

The MDN based approach has been implemented and runs on a Pentium 
PC under Linux operating system. Figure 2 illustrates the experimental soft- 
ware tool called NONE, which provides a flexible graphical interface for build- 
ing MDN models, and exploring the predictive distributions. NONE is pro- 
grammed in Java, and thus can be used with any Java compatible Internet 
browser. A running Java™ demo of the software can be accessed through our 
WWW homepage at URL “http: //www.cs. Helsinki. FI/research/cosco/”. 
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7 Conclusion 



In this paper we have discussed some of the methodological issues of using a 
class of neural networks, called Mixture Density Networks, for discriminant 
analysis. We demonstrated that, as opposed to many other neural network 
models, Mixture Density Networks have the advantage of having a rigorous 
probabilistic interpretation, and thus the resulting models can also be used for 
explorative purposes. In addition MDN have proven to be a viable alternative 
as a classification procedure in discrete domains, which is supported by the 
results in the empirical part of our work. The use of full joint probability 
models in discriminant analysis raises interesting methodological questions, 
some of which were addressed in our discussion. This paper has discussed 
ongoing research, and more extensive theoretical and experimental treatment 
e.g., in the context of factor analysis is a topic for future work. 
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